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Abstract 

Detecting outliers in a large data set is a major data 
mining task. The existing approaches in this field are 
categorized into two main categories which are 
distance-based and density-based outlier detection 
approaches. Although, Local Outlier Factor (LOF) is 
considered as the most popular density-based 
algorithm, it still has some problems related to the 
speed and accuracy. Enhancing LOF algorithm has 
been the focus of many researchers working in this 
field. Among the improved versions of LOF, GridLOF 
has been proven to have a very good performance. This 
paper presents an enhancement to GridLOF algorithm 
by replacing one of its steps by a less complex step 
which reduces the complexity to be only 0 ( N ) instead 
of 0(N 2 ) in a novel way. The simulation results show 
that the proposed algorithm outperforms GridLOF 
algorithm in terms of speed and accuracy. 

Keywords: Outlier, outlier detection, data mining, 
LOF, GridLOF. 

Nomenclature 


LOF 

Local Outlier Factor 

LOCI 

Local Correlation Integral 

KLOF 

Kernel Density-Based Local 


Outlier Factor 

PNSR 

Peak signal-to-noise ratio 


1 .INTRODUCTION 

Big data is a term used to describe the exponential 
growth of data, both structured and unstructured. In 
this case the data is so large or complex that traditional 
data processing techniques are inadequate. Big Data is 
mostly known to have three characteristics [1]: 
volume, variety, and velocity. Many factors contribute 
to the increase in data volume. These factors include, 
transaction-based data stored through years, 
unstructured data from social media, and increasing 
amounts of sensor and machine-to machine being 
collected. Velocity means that data must be dealt in a 
timely manner. As technology is evolving very 
rapidly, dealing with torrents of data in the real time 
becomes a challenge for most organizations. Finally, 


as data comes in all types of formats, managing 
different varieties of data is very important to consider 
while dealing with big data. 

In order to discover patterns, unknown correlations, 
and other useful information in large data sets, big data 
analytics are employed [2]. Data mining is the most 
popular technique for data analytics [3]. This can be 
used in many areas such as sales, marketing, finance, 
medicine, supply chain management, and 
manufacturing. The process of mining patterns from 
data needs some primary operations. These processes 
include Data preparation, cleaning, and cleansing [4]. 
These operations deal with detecting and removing 
errors and inconsistencies (outliers) from data in order 
to improve the quality of data. While integrating 
multiple data sources in big data repositories such 
warehouses, the need for data cleaning increases 
significantly. Important issues about data preparation 
are discussed in [5]. 

An outlier is defined in [6] as an observation that 
deviates so much from other observations. Also, in [7], 
outlier is defined as an observation which appears to 
be inconsistent with the other members of the same 
data set. In [8], the terms outlining observation and 
outlier are used synonymously and are defined as a 
data observation that appears to deviate markedly from 
other members in the data set it occurs. 

Outlier detection is the process of finding data 
objects with behaviors that are very different from 
expectation (outliers) [9]. Outlier detection methods 
are categorized into three types [3, 9, 10]: statistical 
methods, proximity-based methods, and clustering- 
based methods. The proposed work in this paper is 
some kind of proximity-based method. There are two 
types of proximity-based outlier detection methods: 
distance-based and density-based methods. A 
distance-based outlier detection method consults the 
neighborhood of an object, which is defined by a given 
radius [10]. An object is then considered an outlier if 
its neighborhood does not have enough other points. A 
density-based outlier detection method [11] 
investigates the density of an object and that of its 
neighbors. Here, an object is identified as an outlier if 
its density is relatively much lower than that of its 
neighbors. 
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Local Outlier Factor (LOF) is considered as the most 
popular density-based outlier detection [7, 12]. The 
main idea of LOF as proposed in [13] is using the 
relative density of an object against its neighbors. This 
relative density is used to assign a degree of being an 
outlier. This degree is called local outlier factor (LOF). 
Although there are many research efforts in the 
literature on simplifying, and enhancing LOF 
algorithm [14, 15, 16], more enhancements need to be 
done to deal with big data. LOF', LOF", Grid LOF 
[17], and FastLOF [18] are examples of these efforts. 

This paper proposes a new outlier detection 
algorithm based on enhancement of LOF algorithm. 
Despite the effort done to enhance the LOF algorithm, 
it still suffers from the complexity problem which 
appears clearly when dealing with big data. To 
improve the performance of the LOF, the proposed 
algorithm focuses on simplifying the step of finding 
the nearest neighbors nodes which is the major 
bottleneck in this algorithm. To prove the 
effectiveness of the proposed algorithm, time is chosen 
as a performance metric to assess the efficiency against 
the current algorithms. 

The remainder of this paper is structured as follows. 
Section 2 provides related work about the outlier 
detection problem. A background about LOF 
algorithm is introduced in section 3. Section 4 presents 
the details of the proposed algorithm. Section 5 
introduces the simulation results and provides an 
analytical discussion on the performance of the 
proposed algorithm. Section 6 introduces the 
conclusions and future work. 


2. RELATED WORK 

The importance of outlier detection comes from the 
fact that the deduced data can be translated into 
actionable information that can be used in many 
applications. These applications include but not 
restricted to fraud detection for credit cards, control 
systems, medical research, communication at runtime 
software, image sharing [32], intelligent transportation 
system, wireless sensor networks, and even human 
skin detection. 

An extensive research effort has been seen in the 
literature tackling outlier detection. In [19], an 
overview of the existing outlier detection techniques in 
database management and data mining fields is 
provided. The introduced techniques have been 
classified according to different dimensions. 
Generally, distance-based [10] and density-based [11] 
outlier detection methods are the most important 
methods for outlier detection. The Major difference 
between the two methods is the granularity level [9] as 
distance-based methods give a higher level of 
granularity. 

Distance-based methods are based on the nearest 
neighbor distances [9]. A point is considered as an 
outlier point if it has k-nearest neighbor distances 
larger than normal points. The detailed granularity 


level of distance-based methods gives more capacities 
in outlier analysis but with more computational cost 
[9]. There are many approaches for outlier detection in 
this category, such as cell-based approach [9], nested 
loop approach [3], and reverse nearest neighbor 
approach [9]. 

In real-world data sets, the structure of data is more 
complex as outliers are considered by their local 
neighborhoods or by the global data distribution [3]. 
Getting outliers by global data distribution is called 
density-based outlier detection. Thus outlier is 
detected by considering the density around each point 
in the data set. There are many approaches for outlier 
detection in this category such as Local Correlation 
Integral (LOCI) [9], histogram-based approaches [9], 
kernel density estimation approaches [9], and Local 
Outlier Factor (LOF) [9, 13]. Besides being the most 
popular density-based outlier detection approach, LOF 
is one of the simplest approaches for outlier detection 
[19]. 

In [13], each object is assigned a degree of being 
outlier. This degree is called the local outlier factor 
(LOF) of an object. This degree depends on how 
isolated the object is with respect to the surrounding 
neighborhood. LOF algorithm has been used in many 
applications such as cloud computing [20], network 
outlier detection [21], data clustering [22], fault 
analysis and detection [23], and steel plates fault 
diagnosis [24]. 

Towards enhancing LOF algorithm, many efforts 
have been seen in the literature. Kernel Density-Based 
Local Outlier Factor (KLOF) is an outlier detection 
algorithm which is based on LOF [25]. In [26], a 
hierarchical framework using approximated LOF is 
used for efficient anomaly detection. Also, an 
enhanced approach for LOF is proposed to be used for 
data mining purposes in [27]. The complexity of 
finding the nearest neighbors in LOF algorithm is 
0(N 2 ) where the complexity of the algorithm itself 
is 0(N ). Thus, many researchers tried to skip the step 
of finding the nearest neighbors in the LOF algorithm. 
In this path, LOF', LOF", and GridLOF have been 
proposed in [17]. GridLOF is the most efficient and 
adaptive algorithm in calculating LOF value for each 
data objects in the data set. GridLOF algorithm also 
increases accuracy as it avoids some false 
identification that may occur in LOF. FastLOF has 
been also proposed in [18] to speed up the LOF 
computation. This is done by randomly dividing the 
dataset into groups. For each group, LOF is calculated 
and the point with LOF value greater than the defined 
threshold is identified (the threshold is the initially 
selected between 1 .0 and 2.0). This process is repeated 
to find better neighbors. Although, FastLOF algorithm 
[18] can get outliers in any dataset, it cannot get all 
neighbors of a point as the data sets are divided 
randomly into groups. 
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3. BACKGROUND 

As the proposed work in this paper is based on 
enhancing the LOF algorithm, some details about LOF 
and its enhanced version GridLOF are provided in this 
section. In [13], LOF is presented with the steps shown 
in figure 1 . 



Figure 1: LOF Calculation Steps 


1 - Get ^-nearest neighbors for an object p. 

2- Get ^-distance for an object p. 

3- Get reachability distance of an object p with 
his ^-nearest neighbors: 

reach — dist k (p, o) = max{k — 
distanced) ,d(jp f o)) (1) 

where d(p, o) is the distance between object 
p and its neighbor object o 


4- 


5- 


Get local reachability density of an object p: 
IrdMinPts (p) 

= 1 


/ 


2oeN M ; n p ts(p) Teach — dist MinPts (p, o) 

\^MinPts(P)\ 


( 2 ) 

This is based on the minimum points 
(MinPts) which is the nearest neighbors of p. 
Get LOF for an object p as shown in figure 2 
by: 


LOF]y[inPts(P ) — 


ZoGN 


frd-MinPtst 0 ^ 
MinPts(p)lrd MinPts (p ) 


\ N MinPts(P)\ 


( 3 ) 



In [17], GridLOF algorithm is proposed as an 
adaptive algorithm which prunes away the portion of 
dataset known to be non-outliers. First, each 
dimension of the data space is quantized into equal 
width intervals, resulting in a grid-based structure. 
Then, for each non empty grid cell c, the neighboring 
grid cells are examined and c is labeled as a boundary 
cell once a neighboring grid cell with less than or equal 
to the pre-defined threshold (a) number of points 
residing in it is found, a is a relatively a small number. 
In the extreme case, a can be set to zero. Finally, only 
the LOF values of points inside boundary cells are 
calculated using above mentioned LOF (5 steps). 
Figure 3 illustrates the idea of GridLOF algorithm. 



4. THE PROPOSED APPROACH 

In GridLOF algorithm, the portion of dataset known to 
be non-outliers is prune away [17]. Local outlier factor 
(LOF) of the remaining points is then calculated. 
Although the overall cost for computing LOF can be 
reduced, GridLOF still has complexity of 0 (iV 2 ). 

The proposed algorithm in this paper replaces the 
first step in the original LOF algorithm, which gets the 
^-nearest neighbors for an object p , by another method 
which gets the nearest neighbors in a more efficient 
and novel manner. The proposed work is based on re- 
defining each point P t by two values: (w, 0 ) (referred 
to an index point P 0 ) instead of the normal 
coordinates (x,y) where: 
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Figure 4: Re-defining each point by (w, 6 ) 


Wj_ 0 : is the distance between P t and P 0 (as shown in 
figure 4) which is computed as: 

Wi-o = V Oi - x 0 ) 2 + ( Vi - yo ) 2 ( 4 ) 
9i_ 0 \ is the angle between line P 0 P t and X axis which 
is computed as: 



For each point, a circle is drawn with radius 
R. The equations representing all the points within the 
circle are deduced which refer only to W , 6, and R. To 
get these equations, we need to define the following 
variables, as shown in figure 5 : 

(Pi- 0 : is the angle between line P 0 Pj and circle’s 
tangent from P 0 which is computed as: 

<Pi - 0 = sin -1 (-M (6) 

Wi_ 0 / 

w[i 0 \ is the minimum distance between P 0 and the 
circle which is computed as: 

0 = W£- 0 - R (7) 

Wi_ 0 is the maximum distance between P 0 and the 
circle which is computed as: 

w i-0 = w i-0 + R (8) 

6[_ 0 : is the minimum angle between circle's tangent 
from P 0 and X axis which is computed as: 

6>f_ 0 = 6>i_ 0 - (pi- o (9) 

di- 0 \ is the maximum angle between circle's tangent 
from P 0 and X axis which is computed as: 

Qi - o = 9i - o + <Pi - o (10) 

Then, the hashed area in figure 5 can be described by 
the following g two equations: 

Wjl 0 < W n -o < W;+ 0 (11) 

or - 0 < On - 0 < 0t o (12) 


Thus, a point P n lies in the hashed area if it satisfies 
equations 11&12. 



Figure 5: Circle around each point in the proposed algorithm 


In order to define the equation of the circle of center 
P shown in figure 5, we use four index points (Po, Pi, 


P2 and P3), as shown in figure 6 , instead of one point 
and thus: 


Wi- 0 < W n _ 0 < Wi-0 

(13) 

or - 0 < 0 n - 0 < or - 0 

(14) 

Wi-! < W n -! < wf-! 

(15) 

91-1 ^ 0 n _ 1 < et-! 

(16) 

Wi- 2 < W n - 2 < wt -2 

(17) 

0i-2 — On- 2 — 0i-2 

(18) 



Figure 6: Defining a circle of center P with four index points (Po, 
Pi, P 2 and P 3 ) 

Wf-3 ^ Wn-3 ^ Wjt 3 (19) 

Of - 3 < 0 n -3 < Ot - 3 (20) 

A point P n lies in the hashed circle if it 
satisfies equations 13 through 20. To use the above 
idea in getting the k-nearest neighbor, for each point 
P in the dataset, a circle with initial radius R is drawn. 
The number of points in this circle is deduced using 
the (w, 6 ) values which are calculated only once with 
each index point using equations 4 & 5. The radius of 
this circle is then increased until the number of points 
in the circle becomes equal to the required k- 
neighbors. 

However, equations 13 through 20 don’t represent 
the circle accurately when we work to get more 
number of neighbors (the equations represents the 
hashed shape in figure 5). For that, if we need to get 
the nearest ten neighbors for example, twelve 
neighbors are used to insure that we get the required 
ten neighbors inside the circle not inside the hashed 
shape. Thus two neighbors are added as an error ratio. 

The complexity of the proposed algorithm, in 
comparison to the GridLOF algorithm, is only 0 ( N ) 
while it is 0(N 2 ) in GridLOF algorithm [17]. In 
GridLOF, to find the nearest neighbors for a point, the 
distance between this point and the other points in the 
dataset is calculated. Thus, to get the nearest neighbors 
for whole dataset, N * N calculations are performed 
(where N is the size of the dataset). Whereas in the 
proposed algorithm, (w, 6 ) for each point is calculated 
only once with each index point and the radius R is 
increased until getting needed neighbors. Thus, only N 
calculations are done. 


5. SIMULATIONS AND RESULTS 

Our experiments are performed using Windows 
application written in C# running on Windows 7 with 
Intel® Core™ i7-3630QM @2.4 GHz processor and 
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16 GB RAM. Datasets are generated using R 
application [28] with sizes between 60K and 1000K 
and outlier of 0.1% to 0.5% of the dataset. Table 1 
shows the different datasets. Implementation is made 
with grid size between 0.34*0.34 for 60K datasets, 
0.5*0. 5 for 125K datasets, 0.75*0.75 for 500K 
datasets, and 0.9*0. 9 for 1000K datasets. 

In these experiments, we will examine using the 
proposed step in the GridLOF algorithm. The main 
problem in the implementation is how to increase the 
radius of the circle. We started with initial value for R 
equals 2 and then it is increased by 2 (R += 2) for each 
iteration. Some points (such as outliers) in the dataset 


need many iterations to get nearest neighbors. Thus we 
need to limit the number of iterations (10 iterations is 
chosen) in order to save time and then we switch to the 
normal GRidLOF algorithm 
Table 1 shows a comparison between GridLOF with 
and without the proposed modification using the 
execution time as an assessment metric. We have 
varied the size of the dataset and the number of outliers 
in it to test the performance at different environments. 
The last column in the table shows the improvement 
made by the proposed algorithm compared with the 
traditional GridLOF algorithm in terms of execution 
time for all test cases used. 


Table 1: execution time results 


Dataset 

Execution Time (Seconds) 

Improvement 
Percentage 
(difference) / 
(GridLOF) % 

Size 

Outlier 

% 

K (nearest 
neighbors) 

GridLOF 

Modified 

GridLOF 

60K 

0.1% 

10 

240.628763 

222.1957088 

7.66 % 

15 

248.961240 

227.5250138 

8.61 % 

21 

252.651451 

232.0782741 

8.14% 

0.2% 

10 

257.855749 

239.8187169 

7.00 % 

15 

260.793917 

245.9040646 

5.71 % 

21 

267.649309 

249.8922932 

6.63 % 

0.5% 

10 

240.403750 

228.5240708 

4.94 % 

15 

244.709997 

230.5451863 

5.79 % 

21 

249.880292 

235.0324430 

5.94 % 

125K 

0.1% 

10 

510.830218 

488.4689388 

4.38 % 

15 

516.922567 

491.7331255 

4.87 % 

21 

524.575004 

494.8103017 

5.67 % 

0.2% 

10 

359.346554 

339.7514326 

5.45 % 

15 

362.343725 

341.9735596 

5.62% 

21 

370.563195 

348.1349121 

6.05 % 

0.5% 

10 

575.717929 

554.9917438 

3.60 % 

15 

616.376255 

562.2411583 

8.78 % 

21 

647.162016 

566.5004021 

12.46 % 

500K 

0.1% 

10 

1754.697363 

1631.817335 

7.00% 

15 

1815.866862 

1635.92757 

9.91% 

21 

1818.568016 

1641.982916 

9.71% 

0.2% 

10 

1902.238802 

1798.229853 

5.47% 

15 

1906.350037 

1803.315144 

5.40% 

21 

1891.903211 

1812.294657 

4.21% 

0.5% 

10 

1350.35752 

1275.496954 

5.54 % 

15 

1363.03296 

1285.336517 

5.70 % 

21 

1378.05982 

1292.721940 

6.19% 

1000K 

0.5% 

10 

2886.046073 

2759.082811 

4.40% 

15 

2976.336237 

2783.228192 

6.49% 

21 

2985.340718 

2810.691762 

5.85% 


Figure 7 shows the execution time (in 
seconds) for both GridLOF and proposed modified 
GridLOF in case of data with size equal 60K and 
outlier percentage ranging from 0.1% to 0.5%. 
Results show that with increasing the number of 
nearest neighbors used in calculating LOF value for 
each point (K), the execution time increases. 
However, the execution time for the proposed 


modified GridLOF is still lower than that of normal 
GridLOF. 

Figure 8 shows the execution time (in seconds) for 
both GridLOF and the proposed modified GridLOF 
in case of dataset with size equal 125K and outlier 
percentage ranging from 0.1% to 0.5%. Results 
show that the proposed modified GridLOF 
algorithm still outperforms the normal GridLOG 
algorithm. 
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^^^Modified GridLOF 0.1% Outlier 
* GridLOF 0.2% Outlier 
^^Modified GridLOF 0.2% Outlier 
M< GridLOF 0.5% Outlier 
^^^Modified GridLOF 0.5% Outlier 
Figure 7: Comparison of execution time of GridLOF and 
Modified GridLOF at 60K dataset 



Figure 8: Comparison of execution time of GridLOF and 
Modified GridLOF at 125K Dataset 


Figure 9 shows the superiority of the proposed 
GridLOF algorithm in case of dataset with size equal 
500K and outlier percentage of from 0.1% to 0.5% 
compared with the normal GridLOF algorithm. 


Figure 10 shows the superiority of the proposed 
GridLOF algorithm in case of dataset with size equal 
1000K and outlier percentage 0.5% compared with 
the normal GridLOF algorithm. 
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* GridLOF 0.2% Outlier 
^^Modified GridLOF 0.2% Outlier 
)K GridLOF 0.5% Outlier 
^^^Modified GridLOF 0.5% Outlier 


Figure 9 - Comparison of execution time of GridLOF and 
Modified GridLOF at 500K Dataset 



Figure 10 - Comparison of execution time of GridLOF and 


Modified GridLOF at 1000K Dataset 
From Figure 7, 8, 9 & 10, experiments prove that 
the proposed modified GridLOF algorithm has a 
better performance than the normal GridLOF in all 
cases covered in the experiments and it is expected 
that this will continue in case of increasing data's 
size, Outlier percentage, or even K (k-nearest 
neighbors) value. 

In order to check the performance of the proposed 
algorithm in terms of accuracy, new experiments are 
done. The proposed algorithm is used to detect and 
remove noise in images using median filter. The 
steps of applying the proposed algorithm on images 
are shown in figure 1 1 . 
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The value of (p depends on needed level of saved 
useful information in the noisy image. In our 
experiments (p= 0 is used. A comparison of the 
proposed algorithm and the normal median filter 
[29] using PNSR (Peak signal-to-noise ratio). Figure 
12 shows the original used image and the image with 
5% noise. 




Figure 12: image 1 (a) original one (b) with 5% Salt & Pepper 
noise 

Figure 13 shows the comparison results of filtering 
the noisy image (with 5% noise level) using the 
normal median filter and the proposed algorithm. 
The same experiments are done with 10%, and 20% 
noise levels and the PSNR is calculated in each case 
and the results are shown in figure 14. 




(a) 




(b) 

Figure 13: image 1 (5% noise) after apply median filter (a) 
Proposed algorithm (b) Normal median filter 


PSNR for Image 1 
lg 17.7886 177937 



Noise 5% Noise 10% Noise 20% 


■ Normal Filter ■ Proposed Algorithm 


Figure 14: PSNR for image 1 with 5%, 10%, & 20% noise 

Figure 14 shows that proposed algorithm has 
higher PSNR with 5%, 10%, 20% outlier 

percentages which proves that the proposed 
modified GridLOF algorithm has higher level of 
accuracy of detecting outliers in any dataset 
compared with normal median filter. 

6. CONCLUSIONS 

Outlier detection can be seen in many applications 
such as fraud detection for credit cards, control 
systems, medical research, image sharing, wireless 
sensor networks, and even human skin detection. 
This paper proposed a new outlier detection 
algorithm based on enhancement of LOF algorithm. 
The proposed algorithm focused on simplifying the 
step of finding nearest neighbors. Time and accuracy 
are chosen as performance metric to assess the 
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efficiency of the proposed algorithm. The proposed 
algorithm outperforms all kinds of LOF algorithm in 
terms of speed. The proposed algorithm is also used 
for image correction by detecting and removing 
outliers in the image. 

Future work will focus on solving some issues that 
occurred during working with GridLOF algorithm 
such as increasing the radius of the circle which 
includes any point and iterations limits before 
getting the nearest neighbors. 
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